Leon Harper (21385662)
Thomas Newton (21365654)
Michal Jedruszczak (21440496)
These are the team members for the group project.
World of Warcraft is a popular MMORPG (Massively multiplayer online role-playing game) video game with millions of user every month. Due to having this many players there is bound to be some cyberbullying/toxic players included in those millions of players, World of Warcraft however is especially toxic and is often ranked as one of the most toxic gaming community’s today. In a survey by ADL (Anti-Defamation League) it was found 66% of adults ages 18-45 have been harassed/bullied in World of Warcraft in 2021 (Hate is No Game: Harassment and Positive Social Experiences in Online Games 2021).
The objective of this project is to create a model that will be able to detect cyberbullying/toxicity. By using the World of Warcraft dataset provided to us it will allow the model to have reference for comments deemed as bullying.
This would be a classification model that when give a comment/statement would decided whether it is bullying or not bullying, it will be able to do this by detecting certain words and phrases.
Currently most video games have an option to filter chat, however this only censors certain words/phrases. Our model will be able to detect strings of words rather than just certain ones.
League of Legends is a popular MOBA (Multiplayer online battle arena) video game with millions of users every month. Due to having this many players there is bound to be some cyberbullying/toxic players included in those millions of players, League however is especially toxic and is often ranked as one of the most toxic gaming community’s today. The current ban rate of all accounts as of September 2022 is 2.25% (Around 2,632,500 account). In a survey by ADL (Anti-Defamation League) it was found 65% of adults ages 18-45 have been harassed/bullied in League of Legends in 2021.
The objective of this project is to create a model that will be able to detect cyberbullying/toxicity. By using the League of legends dataset provided to us it will allow the model to have reference for comments deemed as bullying.
This would be a classification model that when give a comment/statement would decided whether it is bullying or not bullying, it will be able to do this by detecting certain words and phrases.
Currently most video games have an option to filter chat, however this only censors certain words/phrases. Our model will be able to detect strings of words rather than just certain ones.
ADL Survey: Hate is No Game: Harassment and Positive Social Experiences in Online Games 2021 (adl.org) # Step 2: Importing data
wow_posts_df <- read.csv("Data/posts_wow.csv")
wow_annotations_df <- read.csv("Data/annotations_wow.csv")
lol_posts_df <- read.csv("Data/posts_lol.csv")
lol_annotations_df <- read.csv("Data/annotations_lol.csv")
This imports the required data for the project. The data was exported from an SQL script that creates the necessary tables (i.e. posts and annotations) and the data. To simplify the process of importing data, we used the table export wizard to export the SQL table data into csv files using custom SQL as the MySQL Workbench Table Export Wizard doesn’t export all of the data properly.
We are doing a preliminary EDA in order to understand how we should clean the data and the kind of data that we are dealing with.
In our League of Legends posts dataset, we are given 16867 rows of 5 features: topic_id, post_number, author, html_message and timestamp. For the purposes of this project, we will ignore timestamp, as it will not be used to train our models. The additional LoL annotations dataset reveals which messages were flagged as cyberbullying, from which the only useful features are the post_number and topic_id. This information has been combined into a single file – clean_posts_balanced_sample.csv.
Posts in the main dataset are formatted using HTML, meaning that our model will either have to be trained to recognise patterns such as paragraph breaks or we will have to clean the data and transform it into something more appropriate.
Furthermore, as we are only using a single set of features (post_number and topic_id) from the annotations dataset, we may be able to add another column to the posts dataset – a Boolean value representing whether or not a specific post contains cyberbullying. This will act as the target variable for our model building process.
Regarding the types of data we’re given, topic IDs, authors and HTML messages are categorical, while post numbers are ordinal. In the annotations dataset, all values but post number are categorical.
summary(lol_posts_df)
## topic_id post_number author html_message
## Min. : 2.0 Min. : 0 Length:16867 Length:16867
## 1st Qu.: 33.0 1st Qu.: 364 Class :character Class :character
## Median : 140.0 Median : 953 Mode :character Mode :character
## Mean : 604.6 Mean :1393
## 3rd Qu.:1090.0 3rd Qu.:1998
## Max. :2481.0 Max. :5358
## timestamp
## Length:16867
## Class :character
## Mode :character
##
##
##
nrow(lol_posts_df)
## [1] 16867
There are 16867 data points in our dataset.
lol_posts_df %>% select("topic_id") -> lol_topics #gets all unique lol_topics from the dataset
nrow(unique(lol_topics))
## [1] 17
This data set has a total of 17 different topics.
annotated_messages_lol <- merge(x = lol_posts_df, y = lol_annotations_df, by = c("topic_id","post_number"), all = TRUE) #outer join lol_posts_df and lol_annotations_df
annotated_messages_lol %>% filter(annotator %in% c("1","2","3")) -> annotated_messages_lol
annotated_messages_lol[!duplicated(annotated_messages_lol[3]),] -> annotated_messages_lol
annotated_messages_lol %>% drop_na() -> annotated_messages_lol
annotated_messages_lol
This code shows all annotated messages (messages that have been marked as cyberbullying). This is accomplished through performing an outer join between lol_posts_df and lol_annotations_df, then filtering out all the messages that have been annotated and removing null rows. Finally, we remove all duplicate messages.
100 * count(annotated_messages_lol)/count(lol_posts_df)#percentage of posts marked as cyberbullying
For this data set 0.9% of the posts are labeled as bullying. To make the data less skewed, we need to take a balanced sample of bullying and non-bullying messages.
lol_annotations_df %>% select("annotator") -> lol_ann #gets all unique annotators from the annotations dataset
nrow(unique(lol_ann))
## [1] 3
We can see that there are three unique annotators for this dataset.
lol_annotations_df %>% select("offender") -> lol_off #gets all unique offenders from the annotations dataset
nrow(unique(lol_off))
## [1] 152
152 unique offenders have been identified by the annotators.
In our World of Warcraft posts dataset, we are given 4921 rows of 5 features: topic_id, post_number, author, html_message and timestamp. For the purposes of this project, we will ignore timestamp, as it will not be used to train our models. The additional WoW annotations dataset reveals which messages were flagged as cyberbullying, from which the only useful features are the post_number and topic_id. This information has been combined into a single file – clean_posts_balanced_sample.csv.
Posts in the main dataset are formatted using HTML, meaning that our model will either have to be trained to recognise patterns such as paragraph breaks or we will have to clean the data and transform it into something more appropriate.
Furthermore, as we are only using a single set of features (post_number and topic_id) from the annotations dataset, we may be able to add another column to the posts dataset – a Boolean value representing whether or not a specific post contains cyberbullying. This will eliminate the need for the use of two separate datasets to develop our models.
Regarding the types of data we’re given, topic IDs, authors and HTML messages are categorical, while post numbers are ordinal. In the annotations dataset, all values but post number are categorical.
summary(wow_posts_df)
## topic_id post_number author html_message
## Length:16978 Min. : 0 Length:16978 Length:16978
## Class :character 1st Qu.: 226 Class :character Class :character
## Mode :character Median : 708 Mode :character Mode :character
## Mean :1366
## 3rd Qu.:2420
## Max. :4692
## NA's :3
## timestamp
## Length:16978
## Class :character
## Mode :character
##
##
##
##
nrow(wow_posts_df)
## [1] 16978
There are 4921 data points in our dataset.
wow_posts_df %>% select("topic_id") -> wow_topics #gets all unique wow_topics from the dataset
nrow(unique(wow_topics))
## [1] 23
This data set has a total of 11 different topics.
annotated_messages_wow <- merge(x = wow_posts_df, y = wow_annotations_df, by = c("topic_id","post_number"), all = TRUE) #outer join wow_posts_df and wow_annotations_df
annotated_messages_wow %>% filter(annotator %in% c("1","2","3")) -> annotated_messages_wow
annotated_messages_wow[!duplicated(annotated_messages_wow[3]),] -> annotated_messages_wow
annotated_messages_wow %>% drop_na() -> annotated_messages_wow
annotated_messages_wow
This code shows all annotated messages (messages that have been marked as cyberbullying). This is accomplished through performing an outer join between wow_posts_df and wow_annotations_df, then filtering out all the messages that have been annotated and removing null rows. Finally, we remove all duplicate messages.
100 * count(annotated_messages_wow)/count(wow_posts_df)#percentage of posts marked as cyberbullying
For this data set 0.91% of the posts are labeled as bullying. To make the data less skewed, we need to take a balanced sample of bullying and non-bullying messages.
wow_annotations_df %>% select("annotator") -> wow_ann #gets all unique annotators from the annotations dataset
nrow(unique(wow_ann))
## [1] 3
We can see that there are three unique annotators for this dataset.
wow_annotations_df %>% select("offender") -> wow_off #gets all unique offenders from the annotations dataset
nrow(unique(wow_off))
## [1] 102
102 unique offenders have been identified by the annotators.
For pre-processing, we decided to merge the datasets as they are essentially identical to each other in terms of structure and we believed that this would simplify data pre-processing and remove code duplication. We also ignored the annotation csv files (apart from feature extraction) as we found that there wasn’t anything to clean (we would be just cleaning metadata). As we merged the datasets, we found that we could create a “is_bullying” column that would be used as a target variable for the model building process. We also created a “bullying_severity” column as each post can have multiple annotators for each cyberbullying post.
After that, we extracted the messages out of the HTML in order to then pre-process the messages through stemming, removing punctuation and removing stopwords (the messages are HTML as both datasets come from web forums). We then extracted a “word_counts” column for EDA purposes as well as sampling the data in order to balance it out. We then created training and test datasets for model building where each dataset is a document term matrix as we cannot pass in raw text data to the models. Finally, we export the relevant data (including clean data).
# Creates dataset column to merge posts and annotations csv files together
wow_posts_df$dataset <- "WoW"
lol_posts_df$dataset <- "LoL"
wow_annotations_df$dataset <- "WoW"
lol_annotations_df$dataset <- "LoL"
posts_df <- rbind(wow_posts_df, lol_posts_df)
annotations_df <- rbind(lol_annotations_df, wow_annotations_df)
Since wow_posts_df and lol_posts_df have the same structure, we merged the posts and annotation data frames together to simplify pre-processing (this avoids repeating code). However, we will need to analyse the datasets separately for EDA purposes so we created a “dataset” feature to counteract this.
posts_df$id <- paste(posts_df$dataset, posts_df$topic_id, posts_df$post_number, sep="_")
annotations_df$id <- paste(annotations_df$dataset, annotations_df$topic_id, annotations_df$post_number, sep="_")
To simplify the merging of data frames, we will create an ID column so that a left join can be performed on a single column. This mitigates the issues of duplicate topic ids and post numbers as the post numbers are only unique according to the topic id.
merged_df <- left_join(posts_df, annotations_df, by = "id", keep=TRUE)
merged_df$is_bullying <- as.integer(!is.na(merged_df["id.y"]))
drop <- c('topic_id.y', 'post_number.y', 'dataset.y', 'id.y', 'offender', 'victim')
merged_df <- merged_df[, !(names(merged_df) %in% drop)]
# Removes the ".x" characters from the remaining annotations columns
colnames(merged_df) = sub(".x", "", colnames(merged_df))
# Create bullying_severity column
names(merged_df)[names(merged_df) == "annotator"] <- "bullying_severity"
merged_df["bullying_severity"][is.na(merged_df["bullying_severity"])] <- 0
posts_df <- merged_df %>% group_by(id) %>% slice(which.max(bullying_severity))
This code performs a left join to merge the dataframes together. Most of the columns from the annotations dataframe are useless for training an NLP classifier so we will be removing those columns. Since there are duplicate columns on each side, we will be dropping “y” columns.
We also created a bullying_severity column as we found that some posts have been annotated as bullying by multiple annotators which could make this a useful feature for model building.
remove_html <- function(html_msg, isHtml) {
if(isHtml) {
# Remove backslashes when dealing with LoL forum data
html_msg <- gsub("\\\\", '', html_msg)
# Get XML nodes
msg <- xml2::read_html(html_msg)
# Get the block quotes and quotes (blockquotes for WoW, .quote for LoL)
blockquotes <- msg %>% html_nodes("blockquote")
quotes <- msg %>% html_nodes(".quote")
# Remove quote elements for LoL and WoW datasets
xml_remove(blockquotes)
xml_remove(quotes)
msg <- html_text(msg)
return(msg)
}
return(html_msg)
}
The “html_message” column has messages that have HTML and do not contain HTML at all. In order to handle this, we will be creating a “is_html” column that uses a regular expression to detect HTML in order to prevent errors with RVest. The “tm” package does not handle removing HTML content and we cannot simply use a regular expression to remove HTML as the data originates from gaming forums where ”
” elements are frequently used. If we used a regular expression then the content inside the blockquotes would still remain.
To remove the content of the blockquotes, we used RVest to acquire the blockquote element contents as well as any
elements with “.quote” and then we use xml_remove() to remove the blockquote element nodes. We then convert the RVest object back into a string.
# Regex for detecting HTML detect_html_regex <- "<.*?>" # Create is_html column posts_df$is_html <- str_detect(posts_df$html_message, detect_html_regex) # Apply remove_html function to html_message posts_df$html_message <- mapply(remove_html, posts_df$html_message, posts_df$is_html) posts_df <- posts_df[, !(names(posts_df) %in% 'is_html')] # Converts any regex passed into the transformer into a space character toSpaceTransformer <- content_transformer(function (x, pattern) gsub(pattern, "", x)) posts_corpus <- Corpus(VectorSource(posts_df$html_message)) posts_corpus <- posts_corpus %>% tm_map(content_transformer(tolower)) %>% tm_map(toSpaceTransformer, "http\\S+\\s*") %>% tm_map(removeNumbers) %>% tm_map(removeWords, stopwords("english")) %>% tm_map(removePunctuation) %>% tm_map(stemDocument) %>% tm_map(stripWhitespace) posts_df$html_message <- data.frame(text=sapply(posts_corpus, identity), stringsAsFactors = F)$textThis code removes useless characters, stopwords, punctuation and it uses stemming to improve model performance. Certain steps of the pre-processing could be tweaked to improve model performance (e.g. number of stopwords being omitted) as the pre-processing could end up being too rigorous.
We removed the HTML characters first in order to prevent interference when removing punctuation or whitespace.
Word counts
posts_df$word_counts <- str_count(posts_df$html_message, "\\S+")This code gets the word counts for the html messages which can be used for analysing word counts in the EDA. We may also use the word counts to filter messages with word counts that are too low.
posts_df <- posts_df %>% na_if("") %>% na.omitThis code removes NaN rows from posts_df which can become a problem after pre-processing if there were too many stop words in the original messages.
write.csv(posts_df, file="Data/clean_posts.csv")This code exports the clean posts to a csv file to be analysed separately. This also comes in handy in order to save time when performing EDAs as pre-processing can take time (especially on slow computers).
Creating DTM
DTM for combined dataset
corpus = VCorpus(VectorSource(posts_df$html_message)) dtm = DocumentTermMatrix(corpus) dtm = removeSparseTerms(dtm, 0.999) posts_data = as.data.frame(as.matrix(dtm)) posts_data$is_bullying = as.factor(posts_df$is_bullying)DTM for separate datasets
lol_posts <- posts_df %>% filter(dataset=="LoL") wow_posts <- posts_df %>% filter(dataset=="WoW") lol_corpus = VCorpus(VectorSource(lol_posts$html_message)) lol_dtm = DocumentTermMatrix(lol_corpus) lol_dtm = removeSparseTerms(lol_dtm, 0.999) lol_posts_data = as.data.frame(as.matrix(lol_dtm)) lol_posts_data$is_bullying = as.factor(lol_posts$is_bullying) wow_corpus = VCorpus(VectorSource(wow_posts$html_message)) wow_dtm = DocumentTermMatrix(wow_corpus) wow_dtm = removeSparseTerms(wow_dtm, 0.999) wow_posts_data = as.data.frame(as.matrix(wow_dtm)) wow_posts_data$is_bullying = as.factor(wow_posts$is_bullying)We create a document term matrix from the html messages and we remove sparse terms (i.e. empty values) using removeSparseTerms. We then assign a “is_bullying” column for model building and EDA purposes.
ggplot(data=posts_data, aes(x=is_bullying)) + geom_bar()
ggplot(data=wow_posts_data, aes(x=is_bullying)) + geom_bar()
ggplot(data=lol_posts_data, aes(x=is_bullying)) + geom_bar()
As we can see, the data is heavily imbalanced where there isn’t many bullying cases. This will result in the classifier being trained to where it is more accurate at classifying non-bullying cases rather than bullying cases. We will use undersampling because we have plenty of non-bullying data but not enough data for bullying cases (this means we can afford to reduce how much data we are dealing with).
Sampling the data
League of Legends dataset
is_bullying = which(lol_posts_data$is_bullying == 1) not_bullying = which(lol_posts_data$is_bullying == 0) nsamp = min(length(is_bullying), length(not_bullying)) sample_bullying = sample(is_bullying, nsamp) sample_not_bullying = sample(not_bullying, nsamp) lol_posts_data_balanced = lol_posts_data[c(sample_bullying, sample_not_bullying),]World of Warcraft dataset
is_bullying = which(wow_posts_data$is_bullying == 1) not_bullying = which(wow_posts_data$is_bullying == 0) nsamp = min(length(is_bullying), length(not_bullying)) sample_bullying = sample(is_bullying, nsamp) sample_not_bullying = sample(not_bullying, nsamp) wow_posts_data_balanced = wow_posts_data[c(sample_bullying, sample_not_bullying),]Combined dataset
is_bullying = which(posts_data$is_bullying == 1) not_bullying = which(posts_data$is_bullying == 0) nsamp = min(length(is_bullying), length(not_bullying)) sample_bullying = sample(is_bullying, nsamp) sample_not_bullying = sample(not_bullying, nsamp) posts_data_balanced = posts_data[c(sample_bullying, sample_not_bullying),]This creates a sample of the bullying data for balancing purposes. However, this comes at the expense of having much less data to work with as there are significantly less cyberbullying cases versus non-cyberbullying cases.
ggplot(data=posts_data_balanced, aes(x=is_bullying)) + geom_bar()
ggplot(data=wow_posts_data_balanced, aes(x=is_bullying)) + geom_bar()
ggplot(data=lol_posts_data_balanced, aes(x=is_bullying)) + geom_bar()
Train/test split
set.seed(42) part <- sample(2, nrow(posts_data), replace=TRUE, prob=c(0.6, 0.4)) train <- posts_data[part == 1, ] test <- posts_data[part == 2, ]We split the data using a 60:40 train-test split.
Combined data
set.seed(42) part <- sample(2, nrow(posts_data_balanced), replace=TRUE, prob=c(0.6, 0.4)) train_balanced <- posts_data_balanced[part == 1, ] test_balanced <- posts_data_balanced[part == 2, ]We split the data using a 60:40 split. This is for the balanced data.
League of Legends dataset
set.seed(42) part <- sample(2, nrow(lol_posts_data_balanced), replace=TRUE, prob=c(0.6, 0.4)) lol_train_balanced <- lol_posts_data_balanced[part == 1, ] lol_test_balanced <- lol_posts_data_balanced[part == 2, ]World of Warcraft dataset
set.seed(42) part <- sample(2, nrow(wow_posts_data_balanced), replace=TRUE, prob=c(0.6, 0.4)) wow_train_balanced <- wow_posts_data_balanced[part == 1, ] wow_test_balanced <- wow_posts_data_balanced[part == 2, ]Exporting cleaned data
write.csv(posts_data_balanced, file="Data/Balanced/clean_posts_dtm_balanced_sample.csv") write.csv(wow_train_balanced, file="Data/Balanced/clean_wow_train_balanced_sample.csv") write.csv(wow_test_balanced, file="Data/Balanced/clean_wow_test_balanced_sample.csv") write.csv(lol_train_balanced, file="Data/Balanced/clean_lol_train_balanced_sample.csv") write.csv(lol_test_balanced, file="Data/Balanced/clean_lol_test_balanced_sample.csv") write.csv(train_balanced, file="Data/Balanced/train_balanced.csv") write.csv(test_balanced, file="Data/Balanced/test_balanced.csv") write.csv(train, file="Data/train.csv") write.csv(test, file="Data/test.csv") write.csv(posts_data, file="Data/clean_posts_dtm.csv")We export the training and test data to make steps such as model building and EDA easier and more convenient as it can take time to pre-process the data (especially on slower computers).
posts_df %>% filter(dataset == "LoL") -> lol_posts
lol_posts %>% filter(is_bullying == 1) -> bullying_lol
Here we filter out our LoL data, and from that data we filter out all the messages that have been marked as cyberbulling. This allows us to calculate the percentage of cyberbullying cases in our dataset.
nrow(bullying_lol) / nrow(lol_posts) * 100
## [1] 1.530429
For this dataset 1.53% of the posts are labeled as bullying.
lol_tibble <- tibble(txt = lol_posts$html_message)
lol_tibble #transforming the HTML messages into a tibble for an easier workflow
lol_tibble <- lol_tibble%>%
mutate(linenumber = row_number()) %>%
unnest_tokens(word, txt) %>% anti_join(stop_words)
lol_tibble #splitting tibble by words
lol_counts <- lol_tibble %>% count(word, sort=TRUE)
lol_counts #sorting words by count
wordcloud(lol_counts$word, lol_counts$n, max.words = 250,
min.freq=25, random.order=FALSE, colors=brewer.pal(8, "Dark2"))
#creating a word cloud out of sorted list
By using a word cloud to analyse the HTML messages from the League of Legends boards, we have gained several insights into the issue of cyberbullying within the game.
The word helped us to identify the most common words or phrases used in the chat. This allowed us to see if certain language or terminology was frequently used in a negative or bullying context. For example, the words “stupid” and “idiot” were often marked as cyberbullying, which could indicate that players were using those words to mock or belittle others.
lol_bigrams <- tibble(txt = lol_posts$html_message) %>%
unnest_tokens(bigram, txt, token = "ngrams", n = 2)
lol_bigrams
#split original tibble into two-word bigrams
Here we split the LoL tibble into bigram tokens. This allows us to perform an n-gram analysis and see the most common relationships between words.
lol_bigrams <- lol_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
lol_bigrams #separating them out for easier cleaning
lol_bigrams <- lol_bigrams %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
lol_bigrams #filtering out unwanted words
lol_bigrams <- lol_bigrams %>%
filter(!is.na(word1)) %>%
filter(!is.na(word2))
lol_bigrams #removing null values
lol_bigrams <- lol_bigrams %>%
unite(bigram, word1, word2, sep=" ")
lol_bigrams #joining the words back together
This code splits the bigram to remove any stop words, usernames and null values before joining the bigrams back together.
lol_bigram_counts <- lol_bigrams %>% count(bigram, sort=TRUE)
lol_bigram_counts #counting and sorting bigrams
lol_bigram_counts %>%
filter(str_detect(lol_bigram_counts$bigram,"[0-9]", negate = TRUE)) -> lol_bigram_counts
lol_bigram_counts #removing any bigrams with numbers
Here we remove any bigrams that contain numbers, as they don’t provide any useful information.
lol_filtered_bigrams <- lol_bigram_counts %>%
filter(n >= 4)
lol_filtered_bigrams #sorting remaining bigrams with a frequency of 4 or more
This section is responsible for filtering out any bigrams that have a frequency of 4 or more. 4 was chosen as it was a compromise between how many bigrams could be visualised and how much information we could get out of the graphical representation.
lol_separated_bigrams <- lol_filtered_bigrams %>%
select("bigram") %>%
separate(bigram, c("word1", "word2"), sep = " ")
lol_separated_bigrams #separating bigrams again, preparing for graphical representation
lol_bigram_graph <- lol_separated_bigrams %>%
graph_from_data_frame()
lol_bigram_graph #creating bigram graph
## IGRAPH 319ffde DN-- 1761 6031 --
## + attr: name (v/c)
## + edges from 319ffde (vertex names):
## [1] lp ->ffsgive late ->game twin ->fang play ->game
## [5] earli ->game game ->mode play ->dominion rank ->queue
## [9] play ->rank leagu ->legend peopl ->play poison ->mage
## [13] rank ->dominion summon ->rift rank ->game dominion->player
## [17] player ->base game ->play share ->account account ->share
## [21] twist ->treelin lane ->bulli win ->rate loss ->prevent
## [25] mana ->cost ap ->ratio enemi ->team lane ->phase
## [29] everi ->singl fortun ->teller noxious ->blast everi ->time
## + ... omitted several edges
ggraph(lol_bigram_graph, layout = "fr") +
geom_edge_link() +
geom_node_point() +
geom_node_text(aes(label = name), vjust = 1, hjust = 1)
For this project we used a bigram representation to quickly visualize the most common two-word phrases in this large text dataset. This helped us to get a sense of the collocation of words in the data, which gave important insights into the meaning and context of the text.
Additionally, using a bigram representation allowed us to easily identify any unusual or unexpected combinations of words that might be present in the dataset. This was particularly useful when working with unstructured data, as it allowed us to quickly see into some of the relationships between words within the text. By identifying the most common two-word phrases, we were able to see how words were collocating in the dataset, which gave us an idea of how they were being used and the relationships between them. This helped us to identify patterns and themes in the data that would have been difficult to discern by just looking at individual words.
In summary, using a bigram representation was a valuable tool for our exploratory data analysis, as it helped us to quickly gain a better understanding of our dataset by identifying patterns and themes, and also helped us to identify any unusual or unexpected combinations of words that might be present in the HTML message data.
wow_posts <- posts_df %>% filter(dataset == "WoW")
posts_df %>% filter(is_bullying == 1) -> bullying_wow
bullying_wow
Here we filter out our WoW data sample, and from that data we filter out all the messages that have been marked as cyberbulling. This allows us to calculate the percentage of cyberbullying cases in our dataset.
nrow(bullying_wow) / nrow(wow_posts) * 100
## [1] 2.437285
For this data set 2.4% of the posts are labeled as bullying.
wow_tibble <- tibble(txt = wow_posts$html_message)
wow_tibble #transforming the HTML messages into a tibble for an easier workflow
The tibble data type is more flexible than a string, which allows us to perform a larger variety of operations on it - such as filtering.
wow_tibble <- wow_tibble%>%
mutate(linenumber = row_number()) %>%
unnest_tokens(word, txt) %>% anti_join(stop_words)
## Joining, by = "word"
wow_tibble #splitting tibble by words and removing stop words
wow_counts <- wow_tibble %>% count(word, sort=TRUE)
wow_counts #sorting and counting the remaining words
This section separates the tibble by line number and then by word. Stop words such as “it”, “his” and “the” are removed. Then the remaining words are sorted by how frequently they appear in our data.
wordcloud(wow_counts$word, wow_counts$n, max.words = 250,
min.freq=25, random.order=FALSE, colors=brewer.pal(8, "Dark2"))
#creating word cloud
We used a word cloud to quickly visualise the most common words in our large text dataset. This helped us to get a sense of the overall topic and context of the data, as well as identify any patterns or themes that might be present.
Additionally, using a word cloud allowed us to easily identify any outliers or unusual words that might be present in the dataset, which could then be further investigated. This was particularly useful when working with unstructured data, as it allowed us to quickly gain insights without having to manually sift through all of the text. Using a word cloud was a valuable tool in our exploratory data analysis and helped us to quickly gain a better understanding of the dataset.
wow_bigrams <- tibble(txt = wow_posts$html_message) %>%
unnest_tokens(bigram, txt, token = "ngrams", n = 2)
wow_bigrams #split original tibble into two-word bigrams
Here we split the WoW tibble into bigram tokens. This allows us to perform an n-gram analysis and see the most common relationships between the words used.
wow_bigrams <- wow_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
wow_bigrams #separating the words for easier cleaning
wow_bigrams <- wow_bigrams %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
wow_bigrams #filtering out unwanted words
wow_bigrams <- wow_bigrams %>%
filter(!is.na(word1)) %>%
filter(!is.na(word2))
wow_bigrams #removing null values
wow_bigrams <- wow_bigrams %>%
unite(bigram, word1, word2, sep=" ")
wow_bigrams #joining the words back together
This code splits the bigram to remove any stop words, usernames and null values before joining the bigrams back together.
wow_bigram_counts <- wow_bigrams %>% count(bigram, sort=TRUE)
wow_bigram_counts #counting and sorting bigrams
Here we display a sorted list of bigrams by their frequency.
wow_bigram_counts %>%
filter(str_detect(wow_bigram_counts$bigram,"[0-9]", negate = TRUE)) -> wow_bigram_counts
wow_bigram_counts #removing any bigrams with numbers
This function removes any bigrams containing numbers, as they don’t relate to our project.
wow_filtered_bigrams <- wow_bigram_counts %>%
filter(n >= 4)
wow_filtered_bigrams #sorting remaining bigrams with a frequency of 4 or more
Here we filter the bigrams, keeping only the ones that appear 4 or more times in the dataset. The number 4 was chosen due to how many relationships can be represented at once using the ggraph library.
wow_separated_bigrams <- wow_filtered_bigrams %>%
select("bigram") %>%
separate(bigram, c("word1", "word2"), sep = " ")
wow_separated_bigrams #separating bigrams again, preparing for graphical representation
wow_bigram_graph <- wow_separated_bigrams %>%
graph_from_data_frame()
wow_bigram_graph #creating bigram graph
## IGRAPH 5c99086 DN-- 1898 8398 --
## + attr: name (v/c)
## + edges from 5c99086 (vertex names):
## [1] connect->realm realm ->connect pvp ->realm
## [4] play ->game pve ->realm mani ->peopl
## [7] classic->server vanilla->server rp ->realm
## [10] lot ->peopl classic->realm fli ->mount
## [13] popul ->realm peopl ->play argent ->dawn
## [16] hellfir->hellfir low ->popul confirm->kill
## [19] play ->wow vanilla->realm max ->level
## [22] real ->life world ->warcraft defia ->brotherhood
## + ... omitted several edges
ggraph(wow_bigram_graph, layout = "fr") +
geom_edge_link() +
geom_node_point() +
geom_node_text(aes(label = name), vjust = 1, hjust = 1)
The bigram representation helped us to identify the most common two-word phrases used in the chat. This allowed us to see the distribution of words and understand how players were communicating with each other. For example, the phrase “commit suicide” can be seen in the bigram - messages with this phrase often were flagged as cyberbulling.
train_balanced$is_bullying = as.factor(train_balanced$is_bullying)
test_balanced$is_bullying = as.factor(test_balanced$is_bullying)
lol_train_balanced$is_bullying = as.factor(lol_train_balanced$is_bullying)
lol_test_balanced$is_bullying = as.factor(lol_test_balanced$is_bullying)
wow_train_balanced$is_bullying = as.factor(wow_train_balanced$is_bullying)
wow_test_balanced$is_bullying = as.factor(wow_test_balanced$is_bullying)
set.seed(42)
train_control = trainControl(method = "cv", number = 5)
We are using cross validation to help assess how the model will generalize onto new data and to detect overfitting. The aim here is to have a better estimate of the out-of-sample performance of the model.
set.seed(42)
tic()
svm_model = caret::train(is_bullying~., data=train_balanced , method ="svmLinear" , trControl = train_control, verbose=FALSE)
svm_toc <- toc(quiet=T)
tic()
svm_wow_model = caret::train(is_bullying~., data=wow_train_balanced, method ="svmLinear" , trControl = train_control, verbose=FALSE)
svm_wow_toc <- toc(quiet=T)
tic()
svm_lol_model = caret::train(is_bullying~., data=lol_train_balanced, method ="svmLinear" , trControl = train_control, verbose=FALSE)
svm_lol_toc <- toc(quiet=T)
svm_time_taken <- svm_toc$toc - svm_toc$tic
svm_lol_time_taken <- svm_lol_toc$toc - svm_lol_toc$tic
svm_wow_time_taken <- svm_wow_toc$toc - svm_wow_toc$tic
svm_pred_y = predict(svm_model, test_balanced)
svm_wow_pred_y = predict(svm_wow_model, wow_test_balanced)
svm_lol_pred_y = predict(svm_lol_model, lol_test_balanced)
Here we create the trained model, with “is_bullying” as an output as this is what we want to test The method is the type of model we want to use trControl takes control of our parameters
SVM (Support Vector Machine) is a model that uses classification algorithms for two-group classification problems. We used an SVM model here due to the fact our dataset is balanced. We also decided to try out a linear kernel as we believe that we are working with linearly separable datasets and we have found that linear kernels are common in text classification problems.
set.seed(42)
tic()
mlp_model = caret::train(is_bullying~., data=train_balanced , method ="mlp" , trControl = train_control, verbose=FALSE)
mlp_toc <- toc(quiet=T)
tic()
mlp_wow_model = caret::train(is_bullying~., data=wow_train_balanced , method ="mlp" , trControl = train_control, verbose=FALSE)
mlp_wow_toc <- toc(quiet=T)
tic()
mlp_lol_model = caret::train(is_bullying~., data=lol_train_balanced , method = "mlp" , trControl = train_control, verbose=FALSE)
mlp_lol_toc <- toc(quiet=T)
mlp_time_taken <- mlp_toc$toc - mlp_toc$tic
mlp_wow_time_taken <- mlp_wow_toc$toc - mlp_wow_toc$tic
mlp_lol_time_taken <- mlp_lol_toc$toc - mlp_lol_toc$tic
mlp_pred_y = predict(mlp_model, test_balanced)
mlp_lol_pred_y = predict(mlp_lol_model, lol_test_balanced)
mlp_wow_pred_y = predict(mlp_wow_model, wow_test_balanced)
Here we create the trained model, with “is_bullying” as an output as this is what we want to test The method is the type of model we want to use trControl takes control of our parameters
MLP (Multilayer Perceptrons) are very flexible and can be used generally to learn a mapping from inputs to outputs. MLPs are suitable for classification prediction problems which is exactly what we want. They are comprised of one or more layers of neurons. Data is fed to the input layer, there may be one or more hidden layers providing levels of abstraction, and predictions are made on the output layer, also called the visible layer.
set.seed(42)
tic()
glmnet_model = caret::train(is_bullying~., data=train_balanced , method = "glmnet" , trControl = train_control, verbose=FALSE)
glmnet_toc <- toc(quiet=T)
tic()
glmnet_wow_model = caret::train(is_bullying~., data=wow_train_balanced , method = "glmnet" , trControl = train_control, verbose=FALSE)
glmnet_wow_toc <- toc(quiet=T)
tic()
glmnet_lol_model = caret::train(is_bullying~., data=lol_train_balanced , method = "glmnet" , trControl = train_control, verbose=FALSE)
glmnet_lol_toc <- toc(quiet=T)
glmnet_time_taken <- glmnet_toc$toc - glmnet_toc$tic
glmnet_wow_time_taken <- glmnet_wow_toc$toc - glmnet_wow_toc$tic
glmnet_lol_time_taken <- glmnet_lol_toc$toc - glmnet_lol_toc$tic
glmnet_pred_y = predict(glmnet_model, test_balanced)
glmnet_lol_pred_y = predict(glmnet_lol_model, lol_test_balanced)
glmnet_wow_pred_y = predict(glmnet_wow_model, wow_test_balanced)
We have utilized the glmnet package as it has extremely efficient procedures for utilising lasso or elastic-net regularization for logistic regression. Since this is a classification problem, logistic regression will be used.
When we use caret to train a model by the “glmnet” tag, caret will select a generalized linear model (logistic regression in this case) via penalized maximum likelihood.
set.seed(42)
random_train_control_grid = caret::trainControl(method="cv", number=3, search="random")
We decided to go with a random grid search as it simplified the process of hyperparameter tuning as it removed any potential biases that may occur from an uneven distribution of knowledge for the different models trained (i.e. we thought that a random grid search would be much more consistent when comparing tuned models).
set.seed(42)
tic()
svm_linear_tuned = caret::train(is_bullying~ ., data = train_balanced, method = "svmLinear", trControl = random_train_control_grid, tuneLength=5, verbose=FALSE)
svm_linear_tuned_toc = toc(quiet=T)
tic()
svm_linear_wow_tuned = caret::train(is_bullying~ ., data = wow_train_balanced, method = "svmLinear", trControl = random_train_control_grid, tuneLength=5, verbose=FALSE)
svm_linear_wow_tuned_toc = toc(quiet=T)
tic()
svm_linear_lol_tuned = caret::train(is_bullying~ ., data = lol_train_balanced, method = "svmLinear", trControl = random_train_control_grid, tuneLength=5, verbose=FALSE)
svm_linear_lol_tuned_toc = toc(quiet=T)
svm_linear_tuned_time_taken <- svm_linear_tuned_toc$toc - svm_linear_tuned_toc$tic
svm_linear_wow_tuned_time_taken <- svm_linear_wow_tuned_toc$toc - svm_linear_wow_tuned_toc$tic
svm_linear_lol_tuned_time_taken <- svm_linear_lol_tuned_toc$toc - svm_linear_lol_tuned_toc$tic
svm_tuned_pred_y = predict(svm_linear_tuned, test_balanced)
svm_tuned_wow_pred_y = predict(svm_linear_wow_tuned, wow_test_balanced)
svm_tuned_lol_pred_y = predict(svm_linear_lol_tuned, lol_test_balanced)
set.seed(42)
tic()
svm_model_poly = caret::train(is_bullying~ ., data = train_balanced, method = "svmPoly", trControl = random_train_control_grid, tuneLength=5, verbose=FALSE)
svm_poly_toc = toc(quiet=T)
tic()
svm_lol_model_poly = caret::train(is_bullying~ ., data = lol_train_balanced, method = "svmPoly", trControl = random_train_control_grid, tuneLength=5, verbose=FALSE)
svm_poly_lol_toc = toc(quiet=T)
tic()
svm_wow_model_poly = caret::train(is_bullying~ ., data = wow_train_balanced, method = "svmPoly", trControl = random_train_control_grid, tuneLength=5, verbose=FALSE)
svm_poly_wow_toc = toc(quiet=T)
svm_poly_tuned_time_taken = svm_poly_toc$toc - svm_poly_toc$tic
svm_lol_poly_tuned_time_taken = svm_poly_lol_toc$toc - svm_poly_lol_toc$tic
svm_wow_poly_tuned_time_taken = svm_poly_wow_toc$toc - svm_poly_wow_toc$tic
svm_poly_pred_y = predict(svm_model_poly, test_balanced)
svm_lol_poly_pred_y = predict(svm_lol_model_poly, lol_test_balanced)
svm_wow_poly_pred_y = predict(svm_wow_model_poly, wow_test_balanced)
set.seed(42)
tic()
svm_model_rbf = caret::train(is_bullying~ ., data = train_balanced, method = "svmRadial", trControl = random_train_control_grid, tuneLength=5, verbose=FALSE)
svm_rbf_toc = toc(quiet=T)
tic()
svm_lol_model_rbf = caret::train(is_bullying~ ., data = lol_train_balanced, method = "svmRadial", trControl = random_train_control_grid, tuneLength=5, verbose=FALSE)
svm_lol_rbf_toc = toc(quiet=T)
tic()
svm_wow_model_rbf = caret::train(is_bullying~ ., data = wow_train_balanced, method = "svmRadial", trControl = random_train_control_grid, tuneLength=5, verbose=FALSE)
svm_wow_rbf_toc = toc(quiet=T)
# summarizing the results
svm_rbf_tuned_time_taken = svm_rbf_toc$toc - svm_rbf_toc$tic
svm_wow_rbf_tuned_time_taken = svm_wow_rbf_toc$toc - svm_wow_rbf_toc$tic
svm_lol_rbf_tuned_time_taken = svm_lol_rbf_toc$toc - svm_lol_rbf_toc$tic
svm_rbf_pred_y = predict(svm_model_rbf, test_balanced)
svm_wow_rbf_pred_y = predict(svm_wow_model_rbf, wow_test_balanced)
svm_lol_rbf_pred_y = predict(svm_lol_model_rbf, lol_test_balanced)
set.seed(42)
tic()
mlp_model_tuned = caret::train(is_bullying~ ., data = train_balanced, method = "mlp", trControl = random_train_control_grid, tuneLength=5)
mlp_toc = toc(quiet=T)
tic()
mlp_lol_model_tuned = caret::train(is_bullying~ ., data = lol_train_balanced, method = "mlp", trControl = random_train_control_grid, tuneLength=5)
mlp_lol_toc = toc(quiet=T)
tic()
mlp_wow_model_tuned = caret::train(is_bullying~ ., data = wow_train_balanced, method = "mlp", trControl = random_train_control_grid, tuneLength=5)
mlp_wow_toc = toc(quiet=T)
mlp_tuned_time_taken = mlp_toc$toc - mlp_toc$tic
mlp_lol_tuned_time_taken = mlp_lol_toc$toc - mlp_lol_toc$tic
mlp_wow_tuned_time_taken = mlp_wow_toc$toc - mlp_wow_toc$tic
mlp_tuned_pred_y = predict(mlp_model_tuned, test_balanced)
mlp_lol_tuned_pred_y = predict(mlp_lol_model_tuned, lol_test_balanced)
mlp_wow_tuned_pred_y = predict(mlp_wow_model_tuned, wow_test_balanced)
set.seed(42)
tic()
glm_tuned_model = caret::train(is_bullying~., data=train_balanced , method = "glmnet" , trControl = random_train_control_grid, verbose=FALSE, tuneLength=5)
glm_tuned_toc <- toc(quiet=T)
tic()
glm_lol_tuned_model = caret::train(is_bullying~., data=lol_train_balanced , method = "glmnet" , trControl = random_train_control_grid, verbose=FALSE, tuneLength=5)
glm_lol_tuned_toc <- toc(quiet=T)
tic()
glm_wow_tuned_model = caret::train(is_bullying~., data=wow_train_balanced , method = "glmnet" , trControl = random_train_control_grid, verbose=FALSE, tuneLength=5)
glm_wow_tuned_toc <- toc(quiet=T)
glm_tuned_time_taken <- glm_tuned_toc$toc - glm_tuned_toc$tic
glm_lol_tuned_time_taken <- glm_lol_tuned_toc$toc - glm_lol_tuned_toc$tic
glm_wow_tuned_time_taken <- glm_wow_tuned_toc$toc - glm_wow_tuned_toc$tic
glm_tuned_pred_y = predict(glm_tuned_model, test_balanced)
glm_lol_tuned_pred_y = predict(glm_lol_tuned_model, lol_test_balanced)
glm_wow_tuned_pred_y = predict(glm_wow_tuned_model, wow_test_balanced)
For this evaluation, we wanted a comprehensive evaluation of the models involved using key metrics such as recall, precision, f1 score, the time taken to train the models and more to give a better idea as to which models perform best. However, we think that the F1 score is the most important metric as we believe that a cyberbullying classifier should minimize the number of false positives (i.e. people who haven’t actually bullied anybody) and false negatives (i.e. bullies that haven’t been identified). The F1 score acts as a trade-off between precision and recall which is exactly what we need. The F1 score is also robust against uneven class distributions unlike accuracy which gives us flexibility in how we should tweak pre-processing.
We have utilised the following to create an effective evaluation:
svm_confusion_matrix <- caret::confusionMatrix(svm_pred_y, test_balanced$is_bullying)
mlp_confusion_matrix <- caret::confusionMatrix(mlp_pred_y, test_balanced$is_bullying)
glm_confusion_matrix <- caret::confusionMatrix(glmnet_pred_y, test_balanced$is_bullying)
svm_confusion_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 124 45
## 1 28 113
##
## Accuracy : 0.7645
## 95% CI : (0.7133, 0.8106)
## No Information Rate : 0.5097
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.5299
##
## Mcnemar's Test P-Value : 0.06112
##
## Sensitivity : 0.8158
## Specificity : 0.7152
## Pos Pred Value : 0.7337
## Neg Pred Value : 0.8014
## Prevalence : 0.4903
## Detection Rate : 0.4000
## Detection Prevalence : 0.5452
## Balanced Accuracy : 0.7655
##
## 'Positive' Class : 0
##
mlp_confusion_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 117 43
## 1 35 115
##
## Accuracy : 0.7484
## 95% CI : (0.6962, 0.7957)
## No Information Rate : 0.5097
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.4971
##
## Mcnemar's Test P-Value : 0.428
##
## Sensitivity : 0.7697
## Specificity : 0.7278
## Pos Pred Value : 0.7312
## Neg Pred Value : 0.7667
## Prevalence : 0.4903
## Detection Rate : 0.3774
## Detection Prevalence : 0.5161
## Balanced Accuracy : 0.7488
##
## 'Positive' Class : 0
##
glm_confusion_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 133 57
## 1 19 101
##
## Accuracy : 0.7548
## 95% CI : (0.703, 0.8017)
## No Information Rate : 0.5097
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5118
##
## Mcnemar's Test P-Value : 2.194e-05
##
## Sensitivity : 0.8750
## Specificity : 0.6392
## Pos Pred Value : 0.7000
## Neg Pred Value : 0.8417
## Prevalence : 0.4903
## Detection Rate : 0.4290
## Detection Prevalence : 0.6129
## Balanced Accuracy : 0.7571
##
## 'Positive' Class : 0
##
When examining these confusion matrices, it would appear that the number of false positives (i.e. falsely classifying a post as cyberbullying) is the differentiating factor in terms of f1-score as the glmnet model has the least amount of false positives.
svm_linear_tuned_cf_matrix <- caret::confusionMatrix(svm_tuned_pred_y, test_balanced$is_bullying)
svm_poly_cf_matrix <- caret::confusionMatrix(svm_poly_pred_y, test_balanced$is_bullying)
svm_rbf_cf_matrix <- caret::confusionMatrix(svm_rbf_pred_y, test_balanced$is_bullying)
mlp_tuned_cf_matrix <- caret::confusionMatrix(mlp_tuned_pred_y, test_balanced$is_bullying)
glm_tuned_cf_matrix <- caret::confusionMatrix(glm_tuned_pred_y, test_balanced$is_bullying)
Here we create confusion matrices to get values such as F1 score, precision and recalls.
svm_linear_tuned_cf_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 126 46
## 1 26 112
##
## Accuracy : 0.7677
## 95% CI : (0.7167, 0.8136)
## No Information Rate : 0.5097
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.5365
##
## Mcnemar's Test P-Value : 0.02514
##
## Sensitivity : 0.8289
## Specificity : 0.7089
## Pos Pred Value : 0.7326
## Neg Pred Value : 0.8116
## Prevalence : 0.4903
## Detection Rate : 0.4065
## Detection Prevalence : 0.5548
## Balanced Accuracy : 0.7689
##
## 'Positive' Class : 0
##
svm_poly_cf_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 133 47
## 1 19 111
##
## Accuracy : 0.7871
## 95% CI : (0.7373, 0.8313)
## No Information Rate : 0.5097
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5755
##
## Mcnemar's Test P-Value : 0.000889
##
## Sensitivity : 0.8750
## Specificity : 0.7025
## Pos Pred Value : 0.7389
## Neg Pred Value : 0.8538
## Prevalence : 0.4903
## Detection Rate : 0.4290
## Detection Prevalence : 0.5806
## Balanced Accuracy : 0.7888
##
## 'Positive' Class : 0
##
svm_rbf_cf_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 121 47
## 1 31 111
##
## Accuracy : 0.7484
## 95% CI : (0.6962, 0.7957)
## No Information Rate : 0.5097
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.4976
##
## Mcnemar's Test P-Value : 0.08943
##
## Sensitivity : 0.7961
## Specificity : 0.7025
## Pos Pred Value : 0.7202
## Neg Pred Value : 0.7817
## Prevalence : 0.4903
## Detection Rate : 0.3903
## Detection Prevalence : 0.5419
## Balanced Accuracy : 0.7493
##
## 'Positive' Class : 0
##
mlp_tuned_cf_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 112 41
## 1 40 117
##
## Accuracy : 0.7387
## 95% CI : (0.6861, 0.7867)
## No Information Rate : 0.5097
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.4773
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.7368
## Specificity : 0.7405
## Pos Pred Value : 0.7320
## Neg Pred Value : 0.7452
## Prevalence : 0.4903
## Detection Rate : 0.3613
## Detection Prevalence : 0.4935
## Balanced Accuracy : 0.7387
##
## 'Positive' Class : 0
##
glm_tuned_cf_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 122 45
## 1 30 113
##
## Accuracy : 0.7581
## 95% CI : (0.7064, 0.8047)
## No Information Rate : 0.5097
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5169
##
## Mcnemar's Test P-Value : 0.106
##
## Sensitivity : 0.8026
## Specificity : 0.7152
## Pos Pred Value : 0.7305
## Neg Pred Value : 0.7902
## Prevalence : 0.4903
## Detection Rate : 0.3935
## Detection Prevalence : 0.5387
## Balanced Accuracy : 0.7589
##
## 'Positive' Class : 0
##
svm_wow_confusion_matrix <- caret::confusionMatrix(svm_wow_pred_y, wow_test_balanced$is_bullying)
mlp_wow_confusion_matrix <- caret::confusionMatrix(mlp_wow_pred_y, wow_test_balanced$is_bullying)
glm_wow_confusion_matrix <- caret::confusionMatrix(glmnet_wow_pred_y, wow_test_balanced$is_bullying)
svm_wow_confusion_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 37 21
## 1 8 54
##
## Accuracy : 0.7583
## 95% CI : (0.6717, 0.8318)
## No Information Rate : 0.625
## P-Value [Acc > NIR] : 0.00133
##
## Kappa : 0.5126
##
## Mcnemar's Test P-Value : 0.02586
##
## Sensitivity : 0.8222
## Specificity : 0.7200
## Pos Pred Value : 0.6379
## Neg Pred Value : 0.8710
## Prevalence : 0.3750
## Detection Rate : 0.3083
## Detection Prevalence : 0.4833
## Balanced Accuracy : 0.7711
##
## 'Positive' Class : 0
##
mlp_wow_confusion_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 35 22
## 1 10 53
##
## Accuracy : 0.7333
## 95% CI : (0.6449, 0.8099)
## No Information Rate : 0.625
## P-Value [Acc > NIR] : 0.008121
##
## Kappa : 0.4599
##
## Mcnemar's Test P-Value : 0.051830
##
## Sensitivity : 0.7778
## Specificity : 0.7067
## Pos Pred Value : 0.6140
## Neg Pred Value : 0.8413
## Prevalence : 0.3750
## Detection Rate : 0.2917
## Detection Prevalence : 0.4750
## Balanced Accuracy : 0.7422
##
## 'Positive' Class : 0
##
glm_wow_confusion_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 43 36
## 1 2 39
##
## Accuracy : 0.6833
## 95% CI : (0.5922, 0.7652)
## No Information Rate : 0.625
## P-Value [Acc > NIR] : 0.1093
##
## Kappa : 0.4131
##
## Mcnemar's Test P-Value : 8.636e-08
##
## Sensitivity : 0.9556
## Specificity : 0.5200
## Pos Pred Value : 0.5443
## Neg Pred Value : 0.9512
## Prevalence : 0.3750
## Detection Rate : 0.3583
## Detection Prevalence : 0.6583
## Balanced Accuracy : 0.7378
##
## 'Positive' Class : 0
##
When examining these confusion matrices, it would appear that the number of false positives (i.e. falsely classifying a post as cyberbullying) is the differentiating factor in terms of f1-score as the glmnet model has the least amount of false positives.
svm_linear_tuned_cf_matrix <- caret::confusionMatrix(svm_tuned_wow_pred_y, wow_test_balanced$is_bullying)
svm_poly_cf_matrix <- caret::confusionMatrix(svm_wow_poly_pred_y, wow_test_balanced$is_bullying)
svm_rbf_cf_matrix <- caret::confusionMatrix(svm_wow_rbf_pred_y, wow_test_balanced$is_bullying)
mlp_tuned_cf_matrix <- caret::confusionMatrix(mlp_wow_tuned_pred_y, wow_test_balanced$is_bullying)
glm_tuned_cf_matrix <- caret::confusionMatrix(glm_wow_tuned_pred_y, wow_test_balanced$is_bullying)
Here we create confusion matrices to get values such as F1 score, precision and recalls.
svm_linear_tuned_cf_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 37 19
## 1 8 56
##
## Accuracy : 0.775
## 95% CI : (0.6898, 0.8462)
## No Information Rate : 0.625
## P-Value [Acc > NIR] : 0.000323
##
## Kappa : 0.5424
##
## Mcnemar's Test P-Value : 0.054292
##
## Sensitivity : 0.8222
## Specificity : 0.7467
## Pos Pred Value : 0.6607
## Neg Pred Value : 0.8750
## Prevalence : 0.3750
## Detection Rate : 0.3083
## Detection Prevalence : 0.4667
## Balanced Accuracy : 0.7844
##
## 'Positive' Class : 0
##
svm_poly_cf_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 37 21
## 1 8 54
##
## Accuracy : 0.7583
## 95% CI : (0.6717, 0.8318)
## No Information Rate : 0.625
## P-Value [Acc > NIR] : 0.00133
##
## Kappa : 0.5126
##
## Mcnemar's Test P-Value : 0.02586
##
## Sensitivity : 0.8222
## Specificity : 0.7200
## Pos Pred Value : 0.6379
## Neg Pred Value : 0.8710
## Prevalence : 0.3750
## Detection Rate : 0.3083
## Detection Prevalence : 0.4833
## Balanced Accuracy : 0.7711
##
## 'Positive' Class : 0
##
svm_rbf_cf_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 38 21
## 1 7 54
##
## Accuracy : 0.7667
## 95% CI : (0.6807, 0.839)
## No Information Rate : 0.625
## P-Value [Acc > NIR] : 0.00067
##
## Kappa : 0.5314
##
## Mcnemar's Test P-Value : 0.01402
##
## Sensitivity : 0.8444
## Specificity : 0.7200
## Pos Pred Value : 0.6441
## Neg Pred Value : 0.8852
## Prevalence : 0.3750
## Detection Rate : 0.3167
## Detection Prevalence : 0.4917
## Balanced Accuracy : 0.7822
##
## 'Positive' Class : 0
##
mlp_tuned_cf_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 34 19
## 1 11 56
##
## Accuracy : 0.75
## 95% CI : (0.6627, 0.8245)
## No Information Rate : 0.625
## P-Value [Acc > NIR] : 0.002533
##
## Kappa : 0.485
##
## Mcnemar's Test P-Value : 0.201243
##
## Sensitivity : 0.7556
## Specificity : 0.7467
## Pos Pred Value : 0.6415
## Neg Pred Value : 0.8358
## Prevalence : 0.3750
## Detection Rate : 0.2833
## Detection Prevalence : 0.4417
## Balanced Accuracy : 0.7511
##
## 'Positive' Class : 0
##
It would appear that tuning MLP parameters through random grid search has actually decreased the accuracy of the model.
glm_tuned_cf_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 42 25
## 1 3 50
##
## Accuracy : 0.7667
## 95% CI : (0.6807, 0.839)
## No Information Rate : 0.625
## P-Value [Acc > NIR] : 0.00067
##
## Kappa : 0.5466
##
## Mcnemar's Test P-Value : 7.229e-05
##
## Sensitivity : 0.9333
## Specificity : 0.6667
## Pos Pred Value : 0.6269
## Neg Pred Value : 0.9434
## Prevalence : 0.3750
## Detection Rate : 0.3500
## Detection Prevalence : 0.5583
## Balanced Accuracy : 0.8000
##
## 'Positive' Class : 0
##
svm_lol_confusion_matrix <- caret::confusionMatrix(svm_lol_pred_y, lol_test_balanced$is_bullying)
mlp_lol_confusion_matrix <- caret::confusionMatrix(mlp_lol_pred_y, lol_test_balanced$is_bullying)
glm_lol_confusion_matrix <- caret::confusionMatrix(glmnet_lol_pred_y, lol_test_balanced$is_bullying)
svm_lol_confusion_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 70 36
## 1 16 68
##
## Accuracy : 0.7263
## 95% CI : (0.6571, 0.7884)
## No Information Rate : 0.5474
## P-Value [Acc > NIR] : 2.986e-07
##
## Kappa : 0.4586
##
## Mcnemar's Test P-Value : 0.008418
##
## Sensitivity : 0.8140
## Specificity : 0.6538
## Pos Pred Value : 0.6604
## Neg Pred Value : 0.8095
## Prevalence : 0.4526
## Detection Rate : 0.3684
## Detection Prevalence : 0.5579
## Balanced Accuracy : 0.7339
##
## 'Positive' Class : 0
##
mlp_lol_confusion_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 60 31
## 1 26 73
##
## Accuracy : 0.7
## 95% CI : (0.6294, 0.7642)
## No Information Rate : 0.5474
## P-Value [Acc > NIR] : 1.195e-05
##
## Kappa : 0.3976
##
## Mcnemar's Test P-Value : 0.5962
##
## Sensitivity : 0.6977
## Specificity : 0.7019
## Pos Pred Value : 0.6593
## Neg Pred Value : 0.7374
## Prevalence : 0.4526
## Detection Rate : 0.3158
## Detection Prevalence : 0.4789
## Balanced Accuracy : 0.6998
##
## 'Positive' Class : 0
##
glm_lol_confusion_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 77 39
## 1 9 65
##
## Accuracy : 0.7474
## 95% CI : (0.6794, 0.8075)
## No Information Rate : 0.5474
## P-Value [Acc > NIR] : 9.858e-09
##
## Kappa : 0.5051
##
## Mcnemar's Test P-Value : 2.842e-05
##
## Sensitivity : 0.8953
## Specificity : 0.6250
## Pos Pred Value : 0.6638
## Neg Pred Value : 0.8784
## Prevalence : 0.4526
## Detection Rate : 0.4053
## Detection Prevalence : 0.6105
## Balanced Accuracy : 0.7602
##
## 'Positive' Class : 0
##
When examining these confusion matrices, it would appear that the number of false positives (i.e. falsely classifying a post as cyberbullying) is the differentiating factor in terms of f1-score as the glmnet model has the least amount of false positives.
svm_lol_linear_tuned_cf_matrix <- caret::confusionMatrix(svm_tuned_lol_pred_y, lol_test_balanced$is_bullying)
svm_lol_poly_cf_matrix <- caret::confusionMatrix(svm_lol_poly_pred_y, lol_test_balanced$is_bullying)
svm_lol_rbf_cf_matrix <- caret::confusionMatrix(svm_lol_rbf_pred_y, lol_test_balanced$is_bullying)
mlp_lol_tuned_cf_matrix <- caret::confusionMatrix(mlp_lol_tuned_pred_y, lol_test_balanced$is_bullying)
glm_lol_tuned_cf_matrix <- caret::confusionMatrix(glm_lol_tuned_pred_y, lol_test_balanced$is_bullying)
Here we create confusion matrices to get values such as F1 score, precision and recalls.
svm_lol_linear_tuned_cf_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 63 40
## 1 23 64
##
## Accuracy : 0.6684
## 95% CI : (0.5966, 0.7349)
## No Information Rate : 0.5474
## P-Value [Acc > NIR] : 0.0004544
##
## Kappa : 0.3421
##
## Mcnemar's Test P-Value : 0.0438198
##
## Sensitivity : 0.7326
## Specificity : 0.6154
## Pos Pred Value : 0.6117
## Neg Pred Value : 0.7356
## Prevalence : 0.4526
## Detection Rate : 0.3316
## Detection Prevalence : 0.5421
## Balanced Accuracy : 0.6740
##
## 'Positive' Class : 0
##
svm_lol_poly_cf_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 54 47
## 1 32 57
##
## Accuracy : 0.5842
## 95% CI : (0.5106, 0.6551)
## No Information Rate : 0.5474
## P-Value [Acc > NIR] : 0.1718
##
## Kappa : 0.1734
##
## Mcnemar's Test P-Value : 0.1152
##
## Sensitivity : 0.6279
## Specificity : 0.5481
## Pos Pred Value : 0.5347
## Neg Pred Value : 0.6404
## Prevalence : 0.4526
## Detection Rate : 0.2842
## Detection Prevalence : 0.5316
## Balanced Accuracy : 0.5880
##
## 'Positive' Class : 0
##
svm_lol_rbf_cf_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 34 21
## 1 52 83
##
## Accuracy : 0.6158
## 95% CI : (0.5426, 0.6853)
## No Information Rate : 0.5474
## P-Value [Acc > NIR] : 0.033684
##
## Kappa : 0.1997
##
## Mcnemar's Test P-Value : 0.000446
##
## Sensitivity : 0.3953
## Specificity : 0.7981
## Pos Pred Value : 0.6182
## Neg Pred Value : 0.6148
## Prevalence : 0.4526
## Detection Rate : 0.1789
## Detection Prevalence : 0.2895
## Balanced Accuracy : 0.5967
##
## 'Positive' Class : 0
##
mlp_lol_tuned_cf_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 66 30
## 1 20 74
##
## Accuracy : 0.7368
## 95% CI : (0.6682, 0.7979)
## No Information Rate : 0.5474
## P-Value [Acc > NIR] : 5.718e-08
##
## Kappa : 0.4742
##
## Mcnemar's Test P-Value : 0.2031
##
## Sensitivity : 0.7674
## Specificity : 0.7115
## Pos Pred Value : 0.6875
## Neg Pred Value : 0.7872
## Prevalence : 0.4526
## Detection Rate : 0.3474
## Detection Prevalence : 0.5053
## Balanced Accuracy : 0.7395
##
## 'Positive' Class : 0
##
It would appear that tuning MLP parameters through random grid search has actually decreased the accuracy of the model.
glm_lol_tuned_cf_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 71 34
## 1 15 70
##
## Accuracy : 0.7421
## 95% CI : (0.6738, 0.8027)
## No Information Rate : 0.5474
## P-Value [Acc > NIR] : 2.406e-08
##
## Kappa : 0.4893
##
## Mcnemar's Test P-Value : 0.01013
##
## Sensitivity : 0.8256
## Specificity : 0.6731
## Pos Pred Value : 0.6762
## Neg Pred Value : 0.8235
## Prevalence : 0.4526
## Detection Rate : 0.3737
## Detection Prevalence : 0.5526
## Balanced Accuracy : 0.7493
##
## 'Positive' Class : 0
##
plot_roc_curves <- function(model_1_pred_y, model_2_pred_y, model_3_pred_y, model_names, test_data) {
model_1_pred_y <- as.numeric(levels(model_1_pred_y))[model_1_pred_y]
model_2_pred_y <- as.numeric(levels(model_2_pred_y))[model_2_pred_y]
model_3_pred_y <- as.numeric(levels(model_3_pred_y))[model_3_pred_y]
par(pty="s")
model_1_roc <- roc(test_data$is_bullying~model_1_pred_y, plot=TRUE, print.auc=TRUE, col="red", lwd=4, legacy.axes=TRUE, main="ROC Curves")
model_2_roc <- roc(test_data$is_bullying~model_2_pred_y, plot=TRUE, print.auc=TRUE, print.auc.y=0.4, col="blue", lwd=4, add=TRUE)
model_3_roc <- roc(test_data$is_bullying, model_3_pred_y, plot=TRUE, print.auc=TRUE, print.auc.y=0.6, col="green", lwd=4, add=TRUE)
legend("bottomright", legend=model_names, col=c("red", "blue", "green"), lwd=4)
}
plot_roc_curves(svm_pred_y, mlp_pred_y, glmnet_pred_y, c("SVM", "MLP", "GLMNET"), test_balanced)
plot_roc_curves(svm_tuned_pred_y, svm_poly_pred_y, svm_rbf_pred_y, c("SVM Linear", "SVM Poly", "SVM RBF"), test_balanced)
It would appear that the polynomial kernel is most suitable for bullying classification. We will compare this SVM model that uses the polynomial kernel with other models.
plot_roc_curves(svm_poly_pred_y, mlp_tuned_pred_y, glm_tuned_pred_y, c("SVM Poly", "MLP Tuned", "GLMNET"), test_balanced)
plot_roc_curves(svm_wow_pred_y, mlp_wow_pred_y, glmnet_wow_pred_y, c("SVM", "MLP", "GLMNET"), wow_test_balanced)
It would appear that the glmnet model is very effective when used with the world of warcraft dataset as it has a much greater specificity than the other models. However, MLP has a slightly higher AUC.
plot_roc_curves(svm_tuned_wow_pred_y, svm_wow_poly_pred_y, svm_wow_rbf_pred_y, c("SVM Linear", "SVM Poly", "SVM RBF"), wow_test_balanced)
It would appear that the RBF model has the greatest AUC. We can compare this model with other models.
plot_roc_curves(svm_wow_rbf_pred_y, mlp_wow_tuned_pred_y, glm_wow_tuned_pred_y, c("SVM RBF", "MLP Tuned", "GLMNET"), wow_test_balanced)
plot_roc_curves(svm_lol_pred_y, mlp_lol_pred_y, glmnet_lol_pred_y, c("SVM", "MLP", "GLMNET"), lol_test_balanced)
It appears that glmnet consistently has good specificity when non-tuned, giving it a decent AUC score.
plot_roc_curves(svm_tuned_lol_pred_y, svm_lol_poly_pred_y, svm_lol_rbf_pred_y, c("SVM Linear", "SVM Poly", "SVM RBF"), lol_test_balanced)
It would appear that the linear kernel is most suitable for bullying classification. We will compare this SVM model that uses the linear kernel with other models.
plot_roc_curves(svm_tuned_lol_pred_y, mlp_lol_tuned_pred_y, glm_lol_tuned_pred_y, c("SVM Poly", "MLP Tuned", "GLMNET"), lol_test_balanced)
It would appear that MLP has the greatest AUC score when tuned.
For each model, we will be calculating the evaluation metrics we will be using (time taken, precision, sensitivity, f1 score, AUC) to create a table of metrics which can be used to evaluate each model. We are acquiring the time taken to train each model to see how practical the models would be if the models were to be trained continuously with new data in terms of computation. While it doesn’t take that long to train the models with the sample data, it would take much longer if there was more data.
create_table <- function(prediction_list, time_to_train_c, test_data) {
Model_Name = names(prediction_list)
Precision = c()
Recall = c()
F1_Score = c()
AUC = c()
Time_to_Train_secs = time_to_train_c
for (prediction_y in prediction_list) {
Precision <- append(Precision, precision(prediction_y, test_data$is_bullying))
Recall <- append(Recall, recall(prediction_y, test_data$is_bullying))
F1_Score <- append(F1_Score, F_meas(prediction_y, test_data$is_bullying))
AUC <- append(AUC, auc(test_data$is_bullying, as.integer(prediction_y)))
}
results <- data.frame(Model_Name, Precision, Recall, F1_Score, AUC, Time_to_Train_secs)
return(results)
}
prediction_list = list(svm_linear = svm_tuned_pred_y, svm_poly = svm_poly_pred_y, svm_rbf = svm_rbf_pred_y, mlp = mlp_tuned_pred_y, glmnet = glm_tuned_pred_y)
time_to_train = c(svm_time_taken, svm_poly_tuned_time_taken, svm_rbf_tuned_time_taken, mlp_time_taken, glmnet_time_taken)
results = create_table(prediction_list, time_to_train, test_balanced)
results
From these results, we can see that the svm_poly model had the greatest f1_score and the second greatest AUC score where it seems to be equally as good as the glmnet. Choosing between the two models may come down to prioritising either precision or recall. Both models are also relatively quick to train which can reduce production costs.
We can also see that the mlp model performed the worst as it took a long time to train on a relatively small sample of data (almost 2 minutes) and it has the worst F1 and AUC scores. However, it does have the highest precision (at the cost of a relatively bad recall).
prediction_list = list(svm_linear = svm_tuned_wow_pred_y, svm_poly = svm_wow_poly_pred_y, svm_rbf = svm_wow_rbf_pred_y, mlp = mlp_wow_tuned_pred_y, glmnet = glm_wow_tuned_pred_y)
time_to_train = c(svm_wow_time_taken, svm_wow_poly_tuned_time_taken, svm_wow_rbf_tuned_time_taken, mlp_wow_time_taken, glmnet_wow_time_taken)
wow_results = create_table(prediction_list, time_to_train, wow_test_balanced)
results
From these results, we can see that the svm_poly model had the greatest f1_score and the second greatest AUC score where it seems to be equally as good as the glmnet. Choosing between the two models may come down to prioritising either precision or recall. Both models are also relatively quick to train which can reduce production costs.
We can also see that the mlp model performed the worst as it took a long time to train on a relatively small sample of data (almost 2 minutes) and it has the worst F1 and AUC scores. However, it does have the highest precision (at the cost of a relatively bad recall).
prediction_list = list(svm_linear = svm_tuned_lol_pred_y, svm_poly = svm_lol_poly_pred_y, svm_rbf = svm_lol_rbf_pred_y, mlp = mlp_lol_tuned_pred_y, glmnet = glm_lol_tuned_pred_y)
time_to_train = c(svm_lol_time_taken, svm_lol_poly_tuned_time_taken, svm_lol_rbf_tuned_time_taken, mlp_lol_time_taken, glmnet_lol_time_taken)
lol_results = create_table(prediction_list, time_to_train, lol_test_balanced)
lol_results
From these results, we can see that the svm_poly model has the greatest F1_Score whilst the svm_linear has a slightly higher AUC. However, svm_poly has the worst precision whereas glmnet has the highest precision. We could choose the glmnet model as the precision and recall are not as far apart from each other. ### Output results
write.csv(lol_results, "Data/lol_results.csv")
write.csv(wow_results, "Data/wow_results.csv")
write.csv(results, "Data/results.csv")
We output the table results to a csv file to make it easy to share results with other group members.
Thomas Newton
I was responsible for predictive modelling and hyperparameter tuning. I also created the problem statements and references.
Leon Harper
I was responsible for importing the data (converting it from an SQL Dump file into corresponding csv files), cleaning the data and evaluating the performance of relevant models. I helped manage the project through the use of a trello board and I helped individuals on the project who were struggling.
Michal Jedruszczak
In this project I was responsible for the preliminary EDA and the final EDA. I also created the contents section and made sure that all of the references were in MMU Harvard format.
Anti-Defamation League (2022) Hate is No Game: Harassment and Positive Social Experiences in Online Games 2021 [Online][Accessed: November 19, 2022] https://www.adl.org/resources/report/hate-no-game-harassment-and-positive-social-experiences-online-games-2021
Bretschneider, U. and Peters, R. “Detecting Cyberbullying in Online Communities” (2016). Research Papers. Paper 61. http://aisel.aisnet.org/ecis2016_rp/61